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1.  Time  series  background 


The  definitions  and  notation  we  adopt  for  the  functions 
used  to  describe  a  zero  mean  stationary  Gaussian  discrete 
parameter  time  series  Y(t),  t=0,  +1,...  are  as  follows. 

A  "time  domain"  specification  of  the  probability  law  of 
Y( • )  is  provided  by  the  covariance  function. 

R(v)  =  E[Y(t)Y(t+v)],  v=0,  +1,  +2,...; 


or  by  the  variance  R(0)  and  the  correlation  function 


P(v)  -  ■=  Corr  [Y(t) ,  Y(t+v)J. 

To  define  spectral  (frequency)  domain  specification  of 
the  probability  law  of  Y(*)  we  first  assume  summability  of  R(») 
and  p(*).  The  Fourier  transforms  of  R(v)  and  p(v)  are  called 
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the  power  spectrum  S(w)  and  spectral  density  function  f(u>) 

* 

respectively,  and  are  defined  by 


00  • 

S(<d)  -  l 

e-2irivo) 

R(v), 

0£ui£l ; 

oo 

fU)  -  i 

e-2iriva) 

P  (v)  , 

0<uK  1 . 

V=~  o° 


A  spectral  density  is  called  an  autoregressive  spectral 

density  when  it  can  be  expressed  in  the  form  of  eq.  (3.9)  below. 

They  are  used  for  nonparametric  estimation  of  spectral  densities, 

and  for  time  series  model  identification. 

Parzen  (1982)  proposes  that  it  is  useful  in  practice  to 

distinguish  qualitatively  between  three  types  of  time  series: 

no  memory:  white  noise, 

short  memory:  stationary  and  ergodic, 

long  memory:  non- stationary  or  non-ergodic. 

A  no-memory  or  white  noise  time  series  is  a  stationary 

Gaussian  time  series  satisfying  either  of  the  equivalent 

conditions:  p(v)  •  0  for  v>0;  f(u>)  ■  1,0<u><1. 

A  short  memory  time  series  is  a  stationary  time  series 

possessing  a  summable  correlation  function  p(v)  and  a  spectral 

density  f(w)  which  is  bounded  above  and  below  in  the  sense  that 

the  dynamic  range  of  f(w) 

C  max  ^  f  min  ^ 

DR(f)  -  4  0<u><1  f  (w)J  t  <  0<uj<1  f(w)  > 

satisfies  l<DR(f )<».  Then  f(u)  can  be  shown  to  be  representable 
as  the  limit  of  a  sequence  of  autoregressive  spectral  densities 

**(<•>>  • 
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A  long  memory  time  series  is  one  which  is  neither  no 
memory  nor  short  memory;  alternativey ,  a  long  memory  time  series 
is  one  which  is  non-stationary  or  non-ergodic.  It  usually  has 
components  representing  cycles  or  trends. 

2.  Entropy  and  exponential  models 

The  notion  of  entropy  in  statistics  is  usually  first  defined 
for  a  discrete  distribution  with  probability  mass  function  p(x). 
The  entropy  of  this  distribution,  denoted  H(p) ,  is  defined  by 

(1)  H(p)  =  -  l  p(x)  log  p(x) 

.  x 

For  the  distribution  of  a  continuous  real  valued  random  variable 
X,  with  probability  density  function  f(x),  entropy  is  defined 
(analogously  or  formally)  by 

(2)  H(f )  =  -/"  f (x)  log  f(x)  dx 

—  00 

A  concept  closely  related  to  entropy  is  information 
divergence  I(f;g)  between  two  probability  densities  f(x)  and 
g(x),  defined  by 

(3)  I (f ; g)  *  /“(-log  fgj.}f(x>  dx 

The  measure  (3)  is  called  by  statisticians  the  Kullback- 
Liebler  number  because  it  was  introduced  into  statistics  in 
Kullback  and  Liebler  (1951).  It  seems  that  a  more  correct  name 
for  (3)  would  be  the  Kullback  number,  as  the  concept  of  the  use 
of  these  numbers  in  statistical  inference,  as  in  Kullback  (1959), 
is  entirely  due  to  Kullback. 


One  should  note  that  l(f;g)  equals  minus  the  generalized 
entropy  H(f | g)  defined  by 

(4)  H(f | g)  -  /“<-§$■  log  g(x)  dx 

Another  fundamental  concept  is  cross-entropy  defined  by 

(5)  H(f;g)  -  /“  {-log  g(x)}  f(x)  dx. 

—  OD 

Note  that  H(f)  •  H(f;f). 

Information  divergence  is  expressed  in  terms  of  cross¬ 
entropy  and  entropy  by 

(6)  I(f;g)  -  H(f ; g)  -  H(f) 

Important  Information  Inequality: 

(7)  I(f;g)  >  0 

with  equality  if  and  only  if  f  -  g;  consequently 


(8)  H(f)  <  H(f ;g) 

Some  applications  of  entropy  in  probability  and  statistical 
modeling  are  now  described. 

The  method  of  maximum  likelihood  parameter  estimation  can 
be  described  abstractly  as  follows.  One  introduces  a  parametric 
family  of  probability  densities  fg(x),  indexed  by  a  vector 


parameter  0  -  (6^, . . . , 0^)  .  Suppose  there  is  a  true  parameter 
value  *9  in  the  sense  that  the  true  probability  density 
f(x)  =  f~(x) .  Then  ¥  satisfies 

(10)  H(f )  =  H(f ; f— )  -  mjn  H(f;f0). 

To  estimate  ¥  from  data,  one  forms  an  estimator  H(f;fQ)  of 
H(f;f„)  and  defines  an  estimator  §  of  0  by 

(11)  H ( f ; f g )  -  mQn  H(f;fe). 

The  estimator  H(f;f.)  could  be  of  the  form 

t) 


(12)  H(f;ffl)  =  H(f;f8) 


for  a  suitable  raw  estimator  f(x)  of  f(x). 

The  parametric  families  of  probability  densities  f 0 (x) 
are  often  derived  axiomatically  using  a  maximum  entropy  principle 
Natural  Exponential  models :  A  parametric  family  of  probability 
densities  f0(x)  is  said  to  obey  a  natural  exponential  model 
when  it  is  of  the  form 

k 

(13)  log  f  0 (x)  -  j  ^9jTj  (x)  ”*(0p  .  •  •  ,6k) 

where 


,»  £ 

(14)  ^(e i ,  .  .  .  ,9k)  -  log  /  dx  exp  6jTj(x)  , 


Natural  exponential  models  are  maximum  entropy  probability 
densities  in  the  sense  of  the  following  theorem  [see  Guiasu 
(1977)  and  Kagan,  Linnik,  and  Rao  (1973),  p.  409].  Fix  k 


functions  (x) ,  j-l,2,...,k,  and  k  real  numbers  x^, 

such  that  there  exists  probability  densities  f(x)  satisfying 

(15)  /°°  Ti  (x)  f(x)  dx  =  r.,  j-l,...,k. 

—  CO  J  J 

The  density  with  maximum  entropy  H(f)  among  these  densities  is 
of  the  form  (13)  where  e1,...,6k  are  chosen  to  satisfy 


(16) 


/*  T  (x)  f q (x)  dx  ■  x.  ,  j  =  l . k. 

—  00  J  J 


The  aim  of  this  paper  is  to  present  a  new  proof  of  the 
maximum  entropy  character  of  autoregressive  spectral  densities 
which  is  analogous  to  the  simple  proof  of  the  maximum  entropy 
character  of  exponential  models  for  probability  densities. 

We  recall  the  latter.  Verify  that  for  any  f(x)  satisfying  the 
moment  constraints  (15) 

k 

H(f;f0)  =  nex . ek)  -  ^  ejTj 


(17) 


-  H(ffl). 


and  therefore 


(18)  H(f )  <  H(f;f0)  =  H(f0). 

Thus  the  maximum  entropy  is  achieved  by  f0(x). 
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3.  Entropy  of  spectral  density  functions 

To  extend  entropy  concepts  to  short  memory  stationary 
zero  mean  Gaussian  time  series,  define  the  information 
divergence  for  a  sample  Y(t),  t=l,2,...,T  as  a  function  of 
the  true  probability  density  f  of  the  sample,  and  a  model  g  for 
f.  We  define 

(1)  I(f;g)  -  III  IT(f;g) 

_1  f  g(Y(l) . Y(T))1 

(2)  iT(f;g)  =  -t  Ef  [log  f(vay,T.  :,Ym')  J 

It  should  be  noted  that  we  are  using  the  notation  f  and  g 
with  a  variety  of  meanings.  For  a  Gaussian  zero  mean  stationary 
time  series,  the  probability  density  of  the  sample  is  specified 
by  the  spectral  densities  f(u>)  of  the  true  distribution 
and  g(w)  of  the  model.  The  arguments  of  the  information 
divergence  I(f;g)  indicate  spectral  densities  in  the  following 
discussion.  Pinsker  (1963)  derives  the  following  very  important 
formula: 

(3)  Kfi  g)  -  $  Z1  c|®  -  log  f$  -  1)  do, 
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Since  u  -  log  u  -  1  _>  0  for  all  u,  I  has  two  of  the  properties 
of  a  distance:  I(f;g)  _>  0,  I(f;f)  =  0.  However  I  does  not 
satisfy  the  triangle- inequality . 

We  define  the  cross-entropy  of  spectral  density  functions 
f(w)  and  g  (w)  by 

(4)  H(f;g)  =  \  J1  (log  g(u)  +  |£$-}  du> 

The  entropy  of  f  is 

(5)  H(f)  =  H(f ; f)  =  \  f1  {log  f(io)  +  1}  du> 

o 

Information  divergence  can  be  expressed 

(6)  I(f;g)  =  H(f ; g)  -  H(f) 

Hence 

(7)  H(f)  <  H(f ;g) 

An  approximating  autoregressive  spectral  density  of  order 
m,  denoted  fm(w) ,  to  a  spectral  density  f(w)  is  defined  by 

(8)  H(frfm)  -  “in 

m 

where  the  minimization  is  over  all  f„  of  the  form 

m 
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(9)  fm(u>)  -  |gm(e27rlaj>r2  , 

(10)  ^(z)  =  1  +  o^d)  z+.-.+c^Cm)  zm 
One  may  verify  that 

(11)  H(f;fm)  “  \  <log  am  +  rr  / *  ^ |(e27rlu))  |  2  f(u>)du>} 

m  o 

The  coefficients  a2  ,  a  (1) , . . . ,oL(m)  of  the  minimum  cross- 

m  m  m 

entropy  approximating  autoregressive  spectral  density  satisfy 

(12)  *  /,IIm(e2’1“)|!  £<»)  d„ 

-  f  gm(e^7riw)  £(«.)  diu 
o 

-  ”in  /Mr  (e2iriu))\2  f(«)  dMf 
^  o  ™ 

(13)  /‘5  («2’,1“)*"2"ik“  f<“>  *•  I 
o 

=  I  3)  p(j-k)  -  0,  k“l, 2, . . . ,m 
j=0  m 

Further 

(14)  H(f;In)  -  ^  dog  o'  +  1}  -  H(Im) 

The  autoregressive  spectral  density  7m(u>)  can  be  derived 
axiomatically  using  a  maximum  entropy  principle. 
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Theorem:  The  spectral  density  with  maximum  entropy  among 
all  spectral  densities  f(ui)  satisfying  the  constraints 

(15)  J1  e27Tia,j  f(u)  dw  =  p(j),  j-1,2 . m 

o 

for  m  specified  correlation  coefficients  p(l) , . . . , p(m)  is 
Tm(u))  whose  coefficients  are  determined  by  (12)  and  (13). 

Proof:  It  may  be  verified  that  Tm(uj)  satisfies  the 

constraints  (15) ,  and  (14)  holds  for  any  f(oo)  satisfying  (15). 
Since 

(16)  H(f)  <  H(f;Tm)  -  H(rm). 

it  follows  that  fm  has  maximum  entropy  among  all  spectral 
densities  satisfying  the  constraints'  (15). 

A  proof  of  this  theorem  based  on  prediction  theory  (due 
to  Akaike) ,  is  given  in  Priestley  (1981),  p.  605.  Our  proof 
has  the  attraction  of  emphasizing  the  parallels  between 
exponential  densities  and  autoregressive  densities. 

4.  Extension  to  entropy  of  density-quantile  functions. 

Parzen  (1979)  uses  autoregressive  densities  to  model 
quantile  density  functions 

q(u)  -  {F_1(u) } 1  =  {f(F‘1(u)}'1 

The  estimators  derived  may  be  shown  to  be  maximum  entropy 
estimators  under  the  constraints  imposed.  This  follows  from 


-li¬ 


the  fact  that  the  entropy  of  a  probability  density  function 
f(x)  can  be  expressed 

H(f)  =  ^  log  q(u)  du  =  ^  -log  f(F~^(u))  du. 

o  o 

These  integrals  are  defined  to  be  the  entropy  of  the  quantile 
density  function  and  density-quantile  function  respectively. 
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